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Chapter 1 
Introduction 



Numerical simulations of jY-body interactions are extremely important in 
astrophysics, plasma physics, fluid dynamics, and molecular dynamics. To 
accomplish such a simulation by directly evaluating the interaction potentials 
requires computation time on the order of Of jV 2 ), which is prohibitively 
expensive for large A r . Consequently! there have been many efforts to develop 
algorithms whose complexity grows more slowly [LecT2]. These algorithm.? 
all make assumptions about the distribution of particles to approximate the 
interaction potentials, and thus are applicable only to certain Masses of jV- 
body interactions. 

The best-known "fast" algorithm is- the tree-cade [App£5j, which uses a 
tree- like data structure to form hierarchical clusters of particles. The algo- 
rithm approximates the force exerted by a cluster on a distant particle as 
the force exerted by an equivalent mass located the cluster's center of mass. 
The overall computational complexity of this algorithm is believed to be 
0{N [ogN). However, the force approximations ire effective when the par- 
ticle distribution is relatively nonuniform, and there are no known a priori 
estimates of the accuracy of these approximations , although methods have 
been -developed for use in practice [Por8r>] [0HS6J. 

rtokhlin and Greengard (GRS6J discovered an N- body algorithm with 
complexity Q(N), The algorithm separatea the computation of the far-field 
potentials from that of the near-field potentials, and expands the fax-field 
potentials into Taylor series. Significantly, given a required precision e, one 
can determine in advance how many terms must be retained in the Taylor 
ries expansions, and thus one can control the accuracy of the approximat 



sc- 
ion. 
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Rokhlin and Greengard's approach relies heavily upon analytical techniques 
to develop the series expansions, to derive error rati mar.es, and to shift se- 
ries expansions between coordinate frames. For the jV-body problem in two 
dimensions, this analysis is greatly simplified by modeling two-dimensional 
apace as the complex plane and using the theory of functions of a complex 
variable. 

This thesis applies Rokhlin and Greeugard's idea to develop ail O(N) 
algorithm for the three-dimensional xV-body problem. Although there is 
nu direct analogue of complex analysis in three dimensions, we are able to 
exploit the harmonic nature of the gravitational potential in order to produce 
suitable series expansions. Wt: implement the algorithm and experimentally 
Verity its accuracy and its. linear growth. We also implement a parallel version 
of the algorithm and present experimental results. 

Chapter 1 of the thesis reviews Rokhlm and Greenyard's algorithm and 
their use of complex variables for the two-dimensional problem and points 
out difficulties in applying the method iu three dimensions. 

Chapter 3 develops a three-dimensional ana^og^ie of Rokhlin and Grccn- 
gaxd's method. The basic idea is to expand the potentials in terms of spher- 
ical harmonics. The analytical basis, of the algorithm includes two theorems 
that derive the required expansions and give hounds on the error terms. 
There are also three lemmas that describe how to shift these expansions from 
one origin to another. Using these analytical res; nils, we present a sequential 
algorithm in three- dimensions, Independently of this work, a similar exten- 
sion of Rokhlin and GreenganTs method has been carried out by Greengard 
[GreSTj. His technique differs from ours in that he uses expansions in polar 
coordinates and we work in Cartesian coordinates. 

Chapter 4 concerns nmnerical verifieations of the three- dimensional al- 
gorithm. We first describe a sequential implementation of the algorithm. 
We discuss our choice of a tree-shaped data structure and describe heuris- 
tic methods for increasing the efficiency of the implementation.. We then 
present experimental results to demonstrate that the algorithm does have 
linear growth and does compute the forces and potentials to within any pre- 
specificd tolerance up to machine precision. We compare the speed of the 
algorithm with that of an C*(jV j ) direct force computation. For a required 
average accuracy of 10"* For potential fields T our algorithm will be faster than 
the direct force computation when there are more than L.dOO particles- 
Chapter 5 presents an extremely fast implementation of the A r -body 
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code— ft parallel implementation of the algorithm,- which runs in 0{log N) 
lime, implemented on the Connection Machine. We compare the experi- 
mental resulta with that of the sequential implementation and find that the 
parallel algorithm has a speed -up of 10 for N = l h OD0 and a speed-up of 100 
for N = lp 1 000- For a large parallel machine^ interprocessor communica- 
tion is often the bottleneck of the computation. By exploiting regularity and 
locality in communication patterns, by combining and delegating [Jiessages, 
we demonstrate how to minimise communication overhead due to message 
Congestion. 

Throughout this thesis, if not specified otherwise, jV will always refer to 
the total number of particles, in a simulation, and p will always refer to the 
highest degree of harmonics retained in an expansion. 



Chapter 2 

The A r -body Algorithm in Two 
Dimensions 



This chapter follows the presentation of the two-dimensional algorithm given 
by Rokhlin and Greeugard in [CjHSG]- We review the major theorems and 
give a heuristic description of the algorithm. The key idea is to use complex 
ar-alysi* to produce a series expansion of the potentials, 

2.1 The Multipole Expansion Method 

In the two-dimensional /V-body problem., we consider point charges and 
Coukmbic forces. For a charge q heated at the point y — (^1,^2) € R % * 
the potential at a point x = {x,,t^} ia qiog \\ x — y f| r If we identify R 2 with 
the complex; plane C 1 via (x\, jf 3 ) *-»*!+ ti^ h we can consider the potential 
to be the red part of the analytic function ^ : C 1 —*■ C 1 

£,(*)= *log(x-£). 

We can expand rtj(x) in a Taylor series that converge? for any £ with 
|x| >b|: 

M*) = *log{i - y) = q [log(*) + log (l - £)] = , [logfz) - £ i (£)* 

If we retain only jp terms of the expansion, the error is boundedt 
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Tikia observation leads to the following jwaJfipoJe ejjwnjtoii; Given an 
ensemble of m charges {9W1 = l,,..,,jn} located at points {|^,» = l,.».,m}> 
with IjfWj < rflh t h e potential ^(i)(«) at any point x € C 1 due to the charge 
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To obtain an approximation accurate lo within a given precision e, it suffices 
to Ktaift Only the first p terms in the multipole expansion (2,2) and to take 
p to he of the order [ — ^°S* e J ■ 

The ability to sum individual: expansions of (2.1) to obtain the multipole 
expansion of (2,2) depends on the fact that these expansions {$j,i)(x)ii = 
l h ...,m} have a common reference point and a common region of conver- 
gence. If the;- don't, we need to shift the series {^^(i), j = 1,...,™] Jo 
a common reference point. Also, to insure that all the shifted series have 
a common region of convergence, we may need to ^ip" a series- so that its 
region of convergence lies inside a disk, rather than outside a disk. Rokhlin 
and Oeeugard [Git86] derive shifting and flipping Lemmas that describe how 
to perform these operations in time independent of the total number of par- 
ticles, These lemmas aJlow us to manipulate the multipole expansions in a 
manner required by the following fast algorithm. 

2*2 Description of the Algorithm 

Rokhlin and Greengard's 0{ N) algorithm computes separately the potentials 
of close- by particles ("near- field potentials" ) and the potentials of far-away 
particles ("far- field potentials''). The near- field potentials are computed us- 
ing direct evaluation. Jf the number of particles near any given particle is 
>-imri-;'i by a small constant, that is, if the distribution r,f particles ia rela- 
tively uniform, then the work required to compute the near- field potential at 
this particle is of order 0{l), The far- field potential is obtained by evaluat- 
ing the p-term multipole expansion described above at the particle position, 
which takes constant number of operations for a prespecified p. The com- 
putation is organized so that the total amount of work for computing the 
potentials a& all the JV particles is of order O(N). 

More precisely, the two-dimension a] spa<;e under consideration is regarded 
as a square (Figure 2,1). It is divided intu four subsquares. and each of which 
is then recursively subdivided into four suh-subsquares, acid so on. This de- 
Composition is represented using a tree-like data Structure in which each 
square is represented by a node. A square at level I of the tree has four child 
nodes that represent the subsquares at level J -H 1. This recursive decompo- 
sition je continued Lmt.il there are only several particles in a square at finest 
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Figure 2.1: Deeompositio n in a two-dimensionaE *pate P with the corresponding 
data structure 



grain. Under the assumption that the particles are uniformly distributed 1 , 
the level of the tree n t i.e., the level of the decomposition, is usually chosen 
as about [log, N\ . Therefore each square at the finest level haa in average 
one or two particle in it. 

We define for a square its luv^fieid, far-field, and inttmctivf.field. The 
near>neld of a square consists of those neighboring squares that are at the 
satne level of decomposition as the square. In Figure 2.2, the squares labeled 
as B are in r.he near-field of square A. The far-field of a square is defined to be 
the exterior area of its near-field. The interactive- field of a square is the part 
of its far-field that is in the near-field of the square's parent. In Figure 2,2, 
the squares labeled as A J axe in the interactive- field for the square A. 

The goal of the computation is to compute the far-field potential expan- 
sions at all particle positions with 0{N) work. This is achieved by propa- 
gating values, first upwards through the tree, then downwards, as follows:- 

Initially, for each of the squares at the finest level, we compute the p-temj 
multipole expansion, valid outside the square, far the potential due to the 
charges inside the square. The expansion is performed relative to the center 
of the square. 



: Rokhliti [CGHB7] haa an Q{N) adapts t™ of the algprtibm which does not depend 
fur it? perfomiarjte nt\ the paitieJe dialiibutLoDS. 
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Figure 2/2: Near- field, far-field, and interactive-field 



These expansions .fire propagated upwards through the tree to produce, 
for every square at every level* a multipole expansion for the potentiat due 
to all charges located inside the square, The valid regit)]] of convergence for 
the multipole expansion is the far-field of the square. To form the multipole 
expansion for a square at level J, we Lake each of its Level- 1 -1-1 subsquaies., shift 
their multipole expansions to the center of the levet-J square and add them 
together. This is done recursively at each level of the tree while propagating 
upwards through the tree. 

The downward'propagation phase of the computation produces, for every 
square, a local potential expansion due to its far-field Interactions. The 
local expansion for a square is a multipole expansion that is valid inside the 
square. The computation proceeds recursively. Suppose we already have 
a local potential expansion <^ Ap (x) for square A's- parent Ap. The local 
expansion $4(1) for the square A is, by the definition of an interactive- field, 
the aum of <j^ p (i) and those multipole expansions due to particles in A'% 
interactive- field. However. o' Ay [z) in relative to the center of Ap- Tn order 
to combine it with other multipole expansions for the square ,4, we shaft 
$fo(&) to the center of ,4, We have> from the upward-propagation phase of 
the computation, the multipole expansion ^'(i) for each square ,4 r in A's 
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interactive-field. Id order to combine 4>a'{'^): we shift them to the center 
of A, and flip regions of convergence so that they have a common reference 
point and arc valid inside the square A. The sum of the shifted & Af {z) and 
4> A '(z) is the local expansion. 4' A (x) for the square A. This computation \r 
performed recursively all the way down to the leaven of the tree. 

Now we have for every square at the finest decomposition levct the local 
expansion due to its far-field valid in the square. To get the far-field po- 
tential at a given particle, we have only to evaluate the local expansion at 
the particle's position. The near-field potential jr. obtained by evaluating di- 
rectly interactions with particles in the near-field. The sum of the near-field 
potential and the far-field potential is the desired potential at the particle. 

Rokhlin and Greengard [GRSS] show that the total work required is 
jV(ap ! + bp + ck n + d) t where Jt„ is the upper- bound for the number of par- 
ticles io a square at the finest level. The key observation, in this estimate 
is that each shifting Or flipping of a J?- term multipoEe expansion takea 0{p 3 ) 
work, and that the number of squares in a given square's interactive- fietd is 
bounded by 27. The total number of squares at all levels ia 

3 3 

Therefore the upward- propagation phase and the downward- propagation 
phase of the algorithm accounts for the p 1 term in the estimate. The initial 
expansion step and the final evaluation step in the algorithm is responsible 
for the p term, and the direct computation on near-field potentials is for the 
K. terrm 

The overall computational complexity of the algorithm, from the above 
estimate, is of order O(JV), 

Rokhlin and Greengard [CRSci also derive that the error estimate in the 
computation is C(1/2J P+I , Given a precision requirement e, the number of 
terms p retained in the expansion la set as ["lofij^J- 

2.3 Remarks 

The key to the two-dimensional 0{ N) algorithm is that, for the potential ftx) 
due to charges at {#<■>, i = 1, ,.., m}, we evaluate the potential at each of the 
points {x* 3 \j = l, Jll+ n} by applying the muhlpole expansion of 4>(x], rather 
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than by directly evaluating and summing the individual potential* 4jh , ){s) at 
each x^K Our ability to apply this method relies on the fact that Int.- Taylor 

series expansions of the ^f>(r){i = 1 m} ate absolutely convergent power 

aeries whose coefficients are independent of x. We produce the expansion of 
$(x) by summing corresponding coefficients of the expansions (p^^i). In 
order to do ao, these expansions must have a Cuiiuiluii reference point and a 
common region of validity, or they can he shifted so thai they do to, 

The applicability of complex-analytic techniques in the two-dimensional 
case results from the fact that the two-dimensional potential, viewed as a real- 
valued function on IP* is harmonic, and thus is the real part of a complex- 
analytic function on C. In three dimensions „ the potential function is likewise 
bikTEnocuc. Un Fortunately, there are no corresponding complex-analytic tech- 
niques in three dimensions to simplify the analysis. 

In the nest chapter, we will produce expansions of the three-dimensional 
potential that are suitable to support the multipole expansion method- 



Chapter 3 

A three-dimensional A r -body 
algorithm 



We consider now the three-dimensional jV-body problem, where the gravita. 
tional potential ac a point x € fP due to a point mass ^ located at y e Ff 



i* 



|| dt - y || * ' 

Conlombie potential due to a point charge q has the same formula, except 
that jj is replaced by -q [Go!59] [ArnTS], 

In Order to apply the mult-ipc-le expansion method! we must expand this 
potential as an absolutely convergent aeries whose coefficients are indepen- 
dent of x, and we must produce an a priori bound on the error when we 
retain only a finite number of terms in the series. We must aJso he able to 
shift the series expansion to a new origin. 

Thi- .-if:r.,Mi:i;.| function in (3J) is harmonic in everywhere other than ;,■ 
[Kos64j. This suggests the possibility of expanding 4> y [x) 'in terms of spherical 
harmonics jfklobKJj. 

In this chapter we prove two theorems that derive the required expan- 
sions and give bounds on the error terms. There are also three lemmas that 
describe how to shift these expansions from one origin to another. [Jsing 
these analyticaj results, we presenE an analogue of Role him and Greengard's 
algorithm in three dimensions.. 

At the same time the author derived the results [Zha&T], Grcengard at 



:■ :.■■ 



Li 
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extended the two dimensional algorithm to three dimensions [GrefiT]. Hia 
method is similar to the one given here, but the underlying theorems and ex- 
pansions are different, Greengard works in terms of polar coordinates, while 
the formula!) And derivation!- here ire done in terms of Cartesian coordinates. 
In Chapter 4, we present experimental resnLts that demonstrate the CI* 
ror bound and nrder-of-growth estimates given here. However, there are no 
comparable published experimental results for the polar- coordinate form of 
the algorithm. It is unknown how the two different forms of ths Algorithm 
compare in practical imptementationE. 

3.1 The Multipole Expansion in Three Di- 
mensions 

Given an ensemble of particle masses {^ H V f Tj = 1,,„, m} located at points 
[yi n ) £ fl^n =■ l,..,.. T ni}, the gravitational potential at any given point 
x £ IP ia 

H«m rW K || x - jfti*) \y We wil] ahow how to expand 1/f* 1 ' into an 
absolutely convergent series 



'•' 



where Qiji(jf^) ia' independent of jf and depends only on j(™'. This will 
enable ua to get the multipole expansion for the potential 4f(z) due to all the 
masses {m^} by summing Logcthcr the aijjfcdj^l'a weighted by the masses 
fi n \ pW , u ., p* m > respectively: 

The force field is obtained by taking the gradient of the potential field 



7 = -v*{i) 
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The series expansion of (3.5) has the same coefficient!! as that of (3,4), only 
this time, Q,j h (x) h replaced by S^fa). This enables us to use the sime 

expansion formulas for potentials derived in the next sectioo to compute 
niultipolfl expansion;; for farces, 

3.2 Theorems 

This lection derbes necessary formulas; for producing and shifting multipole 
expansions in three dimensions, together with bounds on the error when 
retaining only a finite number of terms in the expansions. 




Figure 3.1: Gravitational field 

Figure 3.1 illustrates the quantities needed in our derivation, The poten- 
tial at point x £ R 3 is due to mass fi at po int y g ft *. j/° > is the reference 
point, and ^ as the arigJe between vectors &% and jrtfe Also there are 
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quantities r =|| s - y || h r fl =j| g - ^ || , and R =|[ r - ff W [|. We need the 
following relationships later in deriving the innltipolc expansion formnjiL 
From the basic triangle relationship, we have 

We -derive the following theorem which produces an expimsioa For 1/r in 
a region of analyticity outside a sphere. This is essential in obtaining the 
multipole expansion (3.4), 



Theorem 3.2.1 Using the quantities as shown in F^ffurz 3, l t if R > Pqj 

thai is, i/ip = (3Ft,ra,sa) measured relative to y^ ii ottiside ^e sp^rnr of 
radius r centered a£ yW, then 









^* = (M) w+ *-^*|pM, (3.8) 



and y Jn t^j, r^j nrp fJie dtrpdwii cosines ofy^iy. 

[f we retain only terms &fi+j+k<p then the crvor satisfies 



kl <C 



■":. 

R 



p+1 



(3.9) 



tu/iere C is independent of p. 



Proof. 

In Figure 3,1, the Cartesian coordinate distance between the points, x 
and y is 



r(sf,y) =|| a-y 



s 
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We expand l/rffr, y) as a Taylor series in powers of (^ - yf>) and get 



1 



J 



r(af hl r) r(* lff ) 



b =bW 



+ E 



lM 



a 



,1=1 %> 






(3.11) 
The subscript ^ = jr Cl=,? following line brace indicates that after the differenti- 
ation h done ^1,^,513 of point j should be replied by yl 05 , yj°',^ Dj of point 
JT (0;| . When x lies outside the sphere of radius r centered at yW\ the above 
Taylor scries converges absolutely and uniformly, we make the substitution 
jf = jf (0> before the- differentiation is performed. Using the identiti 



tjea 



%j? 



\t(x,ij)J dxp \r(x,y)J 



^ = 1,2,3 



liu- 



t 



r(x,y) 
the Taylor seriea (3.11) becomes 



W f 



I 



1 



*■(**>) - ,? 



-i*£f=£* 



- i 



f | 



£^£)*(i)- <-. 



Let ty = ^ - jr^ 'J /r . Since t^, ^ i/ 3 are the direction cosine* of jK*)y r it 
follows: that 



i>* 



3x0 






= £ 



!j: 



r^>^*- 



,/ ; 



t+>+i=r, ^ [ * J *8*\84P*&' 



iX.r. 



Substitute- this into [3. 12), we get 

1 oo „i+j+A 



#i+J+* 



K».>) ^ J iUlH^^&sfc^VsJ- < 3 ' 14 > 

This completes the proof for (3,7), 

Tn order to derive the error bound of {3,9). W will use the fact that 1/r 
can be expanded in powers of R u&ing the Legendf* Polynomials [AadSBj. 
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1 

T- 






n=0- rL 



< 1 



(3-15) 



- >1 



The polynomials F a (i) are called the Legendre Polynomials 
Pvix) = 1, P,(i) = r, P 2 (x) = i(3a a - 1), ... 

The series of (3.15) is absolutely convergent. Notice tha t the Lqf atMJW 

Polynomials jP w (cos7) vary with the directions of the vectors jf(°);r and yWy, 
and thus is a function of the point x at which the potential is to be evaluated^ 
as well as a function of the mass location y. 

Now we can complete the proof of (33). Using the notation 






■(£*£) 



the series of (3-12) becomes 



s=M^£G)' (3 - i5) 



Notice that there is a term- by-tcrm correspondence between this series and 
the series oF (3. 15). By equating terms, we have 



F a (cos 7 ) =(-!)< 



jp«_a*_7i 

a! ftfl 



$ 



(3.17) 



Surprisingly, 



Jfv+3 #* _ l 



a! #r£'Ji 



(— ) only depends on 7 , not R.. 



Using the correspondence between [3,16] and (3.15), we have 



[e| = 



>+-j+ i <p 



" £ *• 



a*+j+* 



:r + - ,_ - / 1 \ 
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v til! ..21 m 



i^=p+i 



Since the Le^eudre poEynomiak have the property of iP^icosj)] < 1, tlie 
error therefore is 



p+i 



1 K* 



R — T"(> I R 



:P+1 



M < E 

= c r ° 

This completes the proof. 



Using the result* derived in Theorem 3,2,1, vrc can produce themultipokr 
expansion of (3.4) in Section 3.1. As in Section 3.1, suppose that all m 
particle masses are located within a spherical region with radius r centered 
at yO, that £», f| jH - 9 PK) ||< rft , n , i, „, tm( we hiVB from Theorem 3.2.1 
the series expansion of 1/rM converging for any 2 with |[ * - p^ ||> r p 

J_ » _d*>+ k / 1 x 



flfid the error estiinale 






iC 



r ,.i 



-irCt 



p-i-r 



« = 1,,-^tn. 



The mullipole expansion of (3.4), therefore, is 

♦f xl - V — "' ' ( : ^ 
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where 






Furth 



crmore. 






ij,k 



'fcidxi9xi 



U*-iP>v 



£{? 



'o 



II * - jr <Q > II 



ui: 



where 



(3.LS) 



M 



c = E ^ 



m 



. i 



The fuJkwmj! tueorciu produces act expansion IJor 1/r in a spherical region 
of anaJylicity, which is the basis for flipping the region of anaJyticity ill 
shift-ings., aa required by the fast algorithm. 

Theorem 3.2.2 Using the Quantities in Figure .?. j again, if R < ro, that is, 

eenic-red of jf^ r fAen 

L * 



(3.20) 



ij'-i-O 



ttftertf 



,_,*** LIJ UJ UJ 

r D fwD $=Q ii«D 



(3.21) 



^ + ; + ft - a - <? - ^ (t - *)l(j - 0)P - 7 )5 

ft*f< A*l > a^j * ^3 one i/te dinecfi d ti c asiacs of Jf^ ' Jf - 

TAfi error bound for truncating term?? &fi + j ■+ k > p satisfies 



\*\<C 



Rf* 1 



(3.22) 
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Proof, 

Using the relationship in {3.S), we have 

1 i 



'IP+4 -2J?r «37 
1 



r<i 



\ 






(3-23) 



Denote by T lt T 3l r 3 the direction chines of i r W| a , 1 and Vm^^s tjie direction 
cosines of jrWy. Then 

cos 7 = 7] *v, + rji^ +■ r 3 ^ 



where 7 is the angle between pW* And yWy. This leads to the equality 
( — ) 2 - 2— cos 7 = J ? + J a + 4 g J 'frl + J^a + gjfrs 

If vre introduce the variables 

'1 ™ — *ta = — t h = — and 

ft = t,(ti - 2i\) t & = t a (t 2 - 2^), 7 = ^(fa _2^), (3.24) 

equation (£.23) becomes 



1 1 1 

— — 

r 



We can expand this as a Taylor series 
that is valid for |q + £ + 7 1 < L But 
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thus 

I If f -! \ (j_+j + t)i - v 

■* .jib v + j + *J *uw 

1 oo 

--E &U*W (3.28) 

whet* 

Substitute 0,^,7 with (1.24), the Taylor series of (3.26) becomes 



>.-.-. 



where 

LtI l^j l*j 

Write (3.27) explicitly in powers of £1,13, and i 5 

1 °° 

where 

°y* - J+J+.N-I'- 
r o 

As in Theorem 3--2-1, the error bouad for truncating Lerms of 1 + j 4- k > p 

in (3,20) satisfies 

]itr p+1 

■ j 

In the following lemma, we produce a (orrnuEa which shifts the center of 
a multipole expansion valid in a region of analyticity outside a sphere, 
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Figure 3.2: l_Emima 3.2.L 
Lemma 3,2.1 Given a muUipcIc expansion 






(*)■ 



(a.as) 



wtft napHrt fa jr (D) f valid in the ngum vutaid? the spfcrrc of radius r Q intend 

at jf<°>, supple we shift the reference point y t0) to y'W with \\ yt°) - tfW m = 
Ar, 09 ibm m %ur ¥ &£. JVaie *Aaf jp = || r _ ^ || , Thf . nRW muttipote 
expansion */ (&$#, twfcfc rtaperf to yfl*), ,- fl ^ region outside the sphere of 
radius r +■ -ir centered at j/'(°J is 



*E*) = £ ^* 



#+/+* 



;j^ a w Axfaxi 






(3.29) 



wAere 



^=EEE a^Uj_ w ^. (3,30) 

ft=0 i3=0 T=0 

0e« Jf = {Xi t X<t t X$} are the Cartesian coordinates of the point a; relative 
to ff'( D >, Wa^t are the coefficients in the expansion far- l/R with respect to 



,«fa) 
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Ttit error bound for retaining terms ofi+j + k<p in {3.29} satisfie./) 



\4<c 



if 



(3.31) 



Proof. 

Using Theorem 3,2. L we expand l//i, with respect to y^ a \ in the eegioa 
outride the sphere of radius Ar centered &\ j/^ a ' aa 



t 



^+j+^ 



J Iff) ' 



where 



R ]rl<*dX\dX$Xi 






■13.32;: 



*li T 3^ *«* direction cosines of y'(°)y( a ). Notice ttlfct after the shifting there 



i ^ 



JT,-^ = p»-j*» ,"= 1,2,3 



Therefore 



_5_ _£_ 
Substitute (3,32) into (3."2S) S we have 






that IS 



OH OQ 



JzU^dxtdxiajQ^K) 



^1+^+^+^+;+^ / 1 



a£ft=biJJi=njo yA i 3A i 0A 3 ^^ y 

which is valid outside the sphere of radius r$ ■+ Ar centered at y'W. Make 
substitutions of« + i = t' h /J+j = j', 7 + Jt « f,we gst 
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where 

i' / ■■' 

So far we have proved the expansion (1.29), 

Since the multipole expansion of (3,29) is unique in terms of the deriva- 
tives of I/A', we could estimate the error bound for truncation in (3.29) as 
if w* expanded it directly with respect to pW. From (5-19) we know that is 

Irvi 4- Af P+ 5 



,'!■■ 



We now show how to convert & raultipota expansion into a locaJ expansion 
valid within a spherical region of analyticity, by shifting the reference point. 
The region of analyticity is flipped. 




Figure 3,3: Lemma 3.2.2 



Lemma 3.2.2 Given a muftjjwfe expansion 



(3.3.1) 
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with reaped to y^, valid in tht region outside the Sphere of radius rjj Eentered 
at y i& \ suppose this time the reference point g^ ' is shifted Ar to y^ with 
Ar > re, an shown in Figure §.& The new multipoie. czptmeion of {3.B3}, 
isith rr.spe.ct to y^, this time inside a sphere, of radius [Ar — m!" centered 
at j'W is 

if* 

where 

1 '"■ 

j4gairt Jf = (J^ij A^-j f .Vj) arc (.hr ctwrTc/ziaaies of the point t relative, to '/'•' T> ■ 
1 1 j-jt -a™ the coefficients in the expansions for l/R with respect to jf^K 
The error bound for retaining terms oJi+j±k<pin (3.$$) is 



w\<c 



r: 



At — r 



;:-l 



(3,36) 



Here IF *s\\ x - tfW \\. 



'root 



FVom Theorem 3-2,2 we can expand 1/R t with respect to ^ Q ', inakfe a 
spherical region with radius (Ar — r<j) and center y^ as 



wh 



cro 



n 



-=- £ ^{Jf^f (3.37) 



f _ni+i +fc UJ LtJ Li J 

M^TT EES^ C3-3S) 

"o -r=0 /9=a &=ri 

/ -J \ (i + j + *-a-J- 7 )! 

V +i + ft - «- p -tJ (i - &y.(j - rfjr(t - vj! 

c «°) ( j /) (v) ( * i, " ,p,k,H,p,!,}w ' 
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Ti,T ZT rg are direction cosines of yWyW, Substitute (3,37) into (3.33) 









(i-«)>(/^0E(j.- 7 )! 



(3.3&) 



The region of analyticity is the inside of the sphere with radius [ Ar- r 9 ) and 
center at y'C°>, After making substitutions of f - a = i\j —ff = j\ jfc _ 7 = ^ 
we get 

(3.40) 



where 






•w 






wAfA? 


xl 


■Cv'j't' 


= . 


I 


oa 


vh'' 


+ &.-* r -l-J.f +- 


.IV 4- all 


<v 



Thifl completes the proof of (3.34). 

Similar to Lemma 3-2.1 , the error bound lor retaining only terms- of i + 
j + * < p in (3.34) is 



M<C 



vf 



Ar — r D 



p+i 



We atso derivfl a formula for shifting an local expansion within a spherical 
region of analyticity. 

Lemma 3,2.3 Given a iocai expansion 



milh respect fo jf< fl > r uuia'd in Mi! ne£i>n inside ihe sphere of radius r a centered 
at yffl, we expand GfrJ inside the sphere of radium r Q - At Entered ai y'M, 
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Figure 3-4: Lemma 3.2,3 
after jr^ t* shifted Ay tit J^ with &r < r as sAoicn in Figur? 3,j r #5 

Kifttt X = ( Jf i j X% j X 3) fAf nc if coordinates of the point 1 vrith respect to y'^j 
and 



dim ■ 51 c rtQ-,j+-j3,A+-v 



Mfttfrw*** ^ 



£ei% j'(*J - jfW = [Ax,, Az Si A^ 

Proof 

This can be derived by re-expanding the expansion 

in powers of Ai T Xi t Xa. No truncation errors are introduced in the calculi- 
Hon of 0.;:::. 
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Having derived the potential in the form of the multipole expansion 

wc recall from the equation (3,5) of Section 3.1 that the force is obtained by 



w litre 



Therefore, 



v= ( d 3 



dxi 



M 






■J-*- 11 *,j.t=0 ij.*=a 

3,3 A Sequential Algorithm 

In this section we give a complete description of the three-dimensional &. 
body algorithm. This algorithm is simiiar to the tWO-dimcnskuui. algor:!: !r : 
of Roklllin and G ree^-arti desr nhe<l hi (Chapter 2, however this time, we use 
the analytical results- obtained in the previous section. This formulation of 
the algorithm is for a sequential computer. We present a parallel version in 
Chapter 5. 

The throe-dimensional spare under consideration is. taken to be a cube 
(Figure 3.5). We apply the operation of subdividing a cube into eight identi- 
cal .aubcubes repeatedly, until a subcube has only a few particles in it. When 
the particle distribution is relatively uniform, the level of subdivision is ap- 
proximately n = [togg N\. By convention f we say that the initial cube is 
at Level 0, and the atomic cub&, i.e., cuhes at the finest level of the spatial 
refinement, are at level n. As in the two-dimensional algorithm, we define for 
a cube its near-field, far-field and interactive-field. The near- field of a cube 
is defined as consisting of those cubes, that are at the name level as the cube, 
and have distance to the center of the cube lets than \f% of the cube edge 
length; the far-fteld of the cube is the complement of its near-field and itself; 
and the interactive-field ot the cube is the part of its far-field contained iu 
its parent's near- field, 
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Figure 3.5'. Decomposition in a three-dimensional space 



For a. cube t at level I, we denote by *[{i) the- madtipole expansion, valid 
in the far- field of the cube, for the potential due to particle* In the cube; and 
by V^z) the local expansion, valid inside the cube, for the potential due to 
those particles in its far-held. Both ^•(■ T ) aJ1 ^ $(*) are expanded relative to 
the center of the cube. 

A description of the algorithm is .given in Figure 3.r>- n is the total levels 
of spatial refinement. 

Step (1) computes initially the mdtipole expansion **(x) for each atomic 
cube. This 5s obtained by adding together series expansions., relative to the 
center of the cube, far potentiate due to individual particles in the cube:. 

Step (2) computes the multipole expansion $(i) for each of the cubes at 
all intermediaJte levels in an upward manner. For a cube i at level I, the 
multipole expansion $4( z ) ia computed hy sijimrjini; £ j+, [j?) of cube Vs eight 
snbeubes that are shifted to the center of the cube L 

Step (3) computes the local expansion p[x) for each of the cubes. For 
a cube i at level i, the computation COttSkS&S of two parts. First it shifts 
fy (&) of cube r& parent to the center of cube a', which constitutes the far- 
field interaction. Then, it computes interaction with its interactive- field by 
summing ${r) of the intcractive^field that are shafted to the center of cube i- 
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The mtn of these two parts is the local expansion ${x) for the cube. This 
is recursively performed when walking down the refinement tree. 

Finally step (4) computes the desired putc-ntial for each of the particles. 
The far- field potential is obtained by evaluating $(*) of the atomic cube 
the particle belongs to at the particle position. The near- field potential is 
obtained by a direct computation. 

The error estimate for the algorithm is given by (a.31) and (3.36). By 
the definition of the near-field and far-field for each of the cubes, we have 
(r, + Ar>/JT < 1/2 in (3.31} and tff{&r - ro ) < 1/2 in (3,36). For a 
required accuracy of c, we choose p=[— logj ej . 

Let T 8 analyse the computational complexity of the algorithm, The total 
number of she cubes at all levels of the subdivision is 



+ 8 1 + ■ ■ ■ + 8 n = 



Step (1) takes 0[N) work to compute JV series expansions for all the N 
particles. Summing these expansions, to form £{#) for every atomic cube 
takes another O(JV) work. 

In step (2), it takes one shifting and seven summations of expansions to 
compute 4{x] for each of the cubes.. This takes constant amount of work for 
a prespecified p. Since the total number of the Cubes ia of Order J 'V" h the work 
to compute ${x) for all the Cubes in step (2) Is of order O(N). 

In order to compute tff» for each of the cubes in step (3)., we need to do 
one shifting on the parent t^jr) and one shifting For each of the cube in its 
near-held, and then sum the resulting expansions. Sinci: the number of cubes 
in each cube's interactive-field is bounded by 567, it takes a constant amount 
of work to compute y(x) for each of the cubes for a fixed p. Therefore step 
(3 J takes 0[N) work in total lo compute ^(t) for all the cubes. 

We have chosen the level * of the subdivision so that the number of par- 
ticles in the near^neld of each atomic cube a bounded by a small constant. 
Thus in step (4), for each of particles the direct computation on the inter- 
action with it* near-field takes work of Order 0(1}. Evaluation of the local 
expansion at the particle position takes again constant amount of work for a 
fixed p. Thus, step (4) takes Q(A r ) work to compute potentials for all the N 
particles. 

The overall complexity of the algorithm described in Figure 3.6, therefore, 
is the sum of those of step (1) through (4), that is, order 0(N). 
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(1) Initial expansion*: lor awl. atomic cube i" t compute ^(j). 

for I firam 1 to 8" with *t*p 1 do begin 

coapute for each of the parti-clee in the cube t tho 3<trlnn 

expansion fox tha patent ial due to the parttclt by lining. 

Theorem 3.2,1;; 

sua theaa eipanaiona for partidea in the cub* i to fpra 

Q*{i) for tne cub a i, 
end. 

{2) Upwnrd-pnth; for each of the tubes i at level I of the spatial Tefrnftmera, 
compute ${(x) in an upward manner. 

foT J *xom m - I to with ff«p —1 do begin 
for I fron 1 to 9 with step 1 do begin 

shift ^• C "+ , C?) of cub* t h * tight subcubea to the center o± 
the cube i by using Lemma 3.2,1; 

Sum tha shift id CJ !+J I>'I of cuts i'» eight aubcubea to foni 
^(i) for the cube :. 
end. 
end. 

(3) Downward- path: for each of the cubes i at level f, compute ^(a) in a 
downward manner. 

for I frnm 1 to n with step 5 do h^gln 

for i fron 1 to R L with step 1 do begin 

(3a.) Ehift ^ f-1 (T) C-f cube f'a parent cube to the center 

of tha cube i r by using Lemma S.2-3; 
(3b > shift £(*') ** *h« intaractive-f ield to the canter of 

cube r„ by using Lemma 3.2.2. 
sum the resulting ajpazifliorLS if (3a> and (3b) to form 
■&,■(#) for the cube i, 
end. 
end. 



Figure J.fc A sequential algorithm 
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{4} FitifiE evaluation: Fi>r ivich. of partk]es h compute the potentid at the particle. 

for particle p from 1 to jV with step 1 da begin 

(4a) evaluate $*{x) of the atomic cube i the particle p belongs 

to at tin particle position; 
(4b) compute directly interactions ffittl particUi Ln its nsar- 

fiald and in the cube i. 

add (4a) and (4b) as the desired potential for the particl* p„ 
end, 



figure 3.7: A sequential algorithm (can't) 



Chapter 4 

Numerical Verifications of the 
Three-dimensional Algorithm 



This cliapter numerical] j- verifies the three-dimensional algorithm of Chap- 
ter 3 with respect to the achieved accuracy and speed of the algorithm, We 
first describe a sequential implementation of the algorithm. We dtecuss our 
choice of a 3-D tree data .structure. and describe heuristic methods for in- 
creasing the efficiency of the implementation based on a detailed analysis on 
the time complexly of the algorithm. We then present experimental results 
to demonstrate that the algorithm does havn linear growth and does compute 
the- forces and potentials to within any prespecified tolerance up to machine 
precision. We also compare the speed of the algorithm with that of a direct 
computation and find that for a required average accuracy of lO - '* (or poten- 
tial fields our implementation will he faster when there are more than 1,000 
particles. 

4,1 A Sequential Implementation 

4.1.1 3-D Tr^c Data Structure 

As described in the previous chapter,, the algorithm requires the data struc- 
ture "used for the implementation to hierarchically decompose a three- 
dimensional space and to support operations such aa insertion, deletion, and 
searching. Therefore a 3-D tree data structure is an obvious choice, 

A 3-D tree [knuSl] is a balanced eightfold-branching tree, in which each 
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Level of nodes corresponds to one level of the three-dimensional spatial de- 
composition. The 3-D tree is an extension of the two dimensional tree- data 
struct ure discussed in Section 2.2 of Chapter 2. In the 2-D tree, a node 
and its four children correspond to a square and its four subsquares; in the 
3-D tree, a node and its eight children correspond to a cube and its eight 
subcubes. Note that the root of the 3-D tree corresponds to the original 
tube of space under consideration , while the leaves of the tree correspond 
ty atomic cubes. Si [ice each node of tile Lrcc corresponds to a cube in the 
spatial decomposition, we regard the word "node" and the word "cube 11 as 
interchangeable. From now on we will refer to a node as if we refer to the 
corresponding cube in three-dimensional space, and use the phrase '"the cen- 
ter of a node 11 instead of "the center the cube". Each node of the tree has 
pointers to its children, and pointers to its near-field and Eta interactivc-fidd, 
as defined before. The near-field and the interact ive-fietd for each node could 
be computed at run time, but in practice this is too expensive. In a long- 
term simulation, which iterates over many time steps, these fields would have 
to be recomputed again and again, Therefore, we initially establish static 
pointers to each node's near-field and interactive- field. This approach saves 
time but requires extra storage space for these pointers. 

Particles arc- initial y ^indncd 11 V=d '-uiics of :\ - tree according r... U.-ir 
positions. After each iteration of a simulation, information about particles, 
such as positions, velocities, etc. is updated- A particle is moved to a new 
node ]f it crosses boundaries of an atomic cube. Coefficients of expansions 
at various stages of the computation are stored in 3-D array.s held by each 
node. The hierarchical clustering of expansions is done by walking up and 
down the free. 

The levei of spatial refinement is chosen approximately as Eog B N. Thus, 
the 3'D tree is about logg jV levels deep. 

4,1.2 Implementation Details 

In this section we discuss the implementation of the algorithm on a Symbolics 
LISP machine, and describe techniques for increasing the efficiency of the 
implementation, 

Speed is the major concern in our implementation, One way to measure 
the efficiency of our implementation is to compare it with an implementation 
of the direct computation. For the first few implementations of our algorithm 
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on the LISP machine, the code rati extremely slowly. For p - 3, the crossover 
point where the running Lime of our method falls below the running time of 
the direct method was At ahoyt 10,000 particles. 

Performance monitoring revealed opportunities for both machine- 
independent and machine-dependent optimizations. 

We discuss the machine- independent optimizations first, Shifting expan- 
sions due to interactive- field potentials is very expensive. This accounts 
for most of the hidden constant in 0{N). For cacti node in the tree, we 
need to do 567 ahiftings on expansions., since each node has 567 nodes in 
its interactive-field. We can reduce this CQSt by grouping together nodes in 
the interactive-fields, More specifically, we replace eight child nodes by their 
common parent node (we call it a tupet-Aodc) if all eight nodes are in the 
interaction.- field of a single node. Using this, heuristic, each node has only 
140 nodes in its interactive- field, Numerical experiments, indicate that this 
modification speeds up the algorithm by a factor of about 8, 

Another source of inefficiency in the initial implementation was '.he re- 
dundancy in computing coefficients in shifting formulas. The repealed calcu- 
lation of factorials and permutations was removed by storing factorials and 
permutations in a table. There is. also repeated calculation of some com- 
mon factors in the formulas, VVe simplified this part of the computation 
by extracting common factors in juiu^ and hy canceling common factors 
appearing in successive stages of the computation. 

For the machine- dependent optimizations, we found that a major portion 
of the running time is spent on manipulating coefficients of expansions in 
shiftings and evaluations. The- bottleneck of the computation is the floating- 
point calculation and indexing of arrays storing coefficients of x'axious expan- 
sions. The array reference time was reduced by representing a 3-D array as 
a 2-B array of 1-D arrays, or as 1-D array of 2-D arrays. This minimizes 
the array reference overhead since 1-D and 2-D arrays are better supported 
on :!:■ r . T "- P ii . i . - - 1 i r _— , Lir," :. :,hl:: u oJ ilic I'^Li-niuve use of arrav ■vvrnnvr. the 
above improvement was significant. Other machine-dependent optimiaa-tioris 
included using tight loops in. frequently called procedures, and avoiding the 
creation of new arrays in manipulating expansions whenever possible, 

The above optimization:-, .s. really reduced both the time and storage re- 
quirements for the computation. The reduction in. storage in turn reduces 



L Ttie idea is due to Rich ZippeL. 
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the time due to disk paging, Overall, the machine- independent. Optimisations 
reduced running time by a factor of S and the machine-dependent optimiza- 
tions provided an additional factor of 4. The running time crossover point 
with the direct computation is now at about 1,000 for p = 3 (See Table 4.4), 

4*2 Experimental Verifications of the Algo- 
rithm 

4.2,1 Accuracy of the Algorithm 

In this S&Ction, we study the actual achieved accuracy of the al^ont' i;i, .unci 
compare it with the theoretical prediction given in Chapter 3. We first test 
Our algorithm on the Pythagorean configuration of three bodies and observe 
bow the- error of the computation scales with p. Then we test the algorithm on 
two typical distribution models, the uniform distribution mode] and PluinmeT 
distribution model, each with 1,000 particles , and again determine how the 
error varies with p. 

(a) The Pythagorean Configuration 

The- Pythagorean configuration of three bodies was first investigated bv 
C. Burrau in 1-013 JSP67J- It is Pythagorean not only in the geometric sense 
but also with respect to the masses. The sides of the triangle formed by the 
three bodies are 3, 4, and 5 and the masses of the three bodies are also 3, 
4, and 5. The configuration is such chat the body wstli mass j is situated 
at the vertex opposite to the side of L-e»j£L^ s, where £ is one of 3, 4, or 5 r 
The initial configuration of the problem and the coordinate system used are 
shown in Figure 4.L Initially the three hodies are situated in the z = plane 
and have speeds zero, consequently the motion is planar, 

We use our algorithm to compute the accelerations of the three bodies 
and observe how error scales With p. The experiment*] results are given in 
Table 4.1 and plotted in Figure 4.2, We define the relative eiror m each of 
the accelerations as 






M - 9 
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figure 4.1: Initial canfigu ration of the Pythagorean configuration of three bodies 

where o^efc^ej is the calculated, acceleration On body * Using our method, in 
single precisian arithmetic, and a;l' Vf is the actual acceleration on that body 
using the direct method offeree computation in double precision arithmetic 
and is therefor* accurate to the midline, round-off error of double precision 
arithmetic. 

In Table 4.1, Emo* is the maximum error of all the relative error-? in ac- 
celerations 
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c rmt is the root-tnean-wiuare error 
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where ti = 3 is the number of bodies in. the test, and tpjirtuiaj-eder^ is the 
:'-.■■" i - ■■ !"rror ::: toiaJ potential energy 
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Tdx,- 4. J: Accuracy test for the Pythagorean con' gu ration of three bodks. pis 
the highest degree of harm on i« retained in an expansion. 



Tabic 4.1 shows that the accuracy of the algorithm is improved by ap- 

proximately a factor of 2 when p is incremented by 1 few p greater Eh an 2. 
Consequent to the dots in Figure 4.2 are distribute J nearly on a Line, since 
the error axis is scaled logarithmieal ly. 

Note that the near-fietd in this implementation has been defined so that 
the ratio appearing in the error bounds (3,31 J and (3.36) of Chapter 3 is 
1/2, The factor of 2 decrease in errors for each increase in p is thus expected 
from the formal analysis. The results presented in Table 4,1 and plotted in 
Figure 4,2 match well with the predicted error bounds. 

The experimental results exhibit two phenomena that warrant further 
discussion. First, the error ^ iivt in individual accelerations does not de- 
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Figure 4.2: Plot of the accuracy test for the Pythagorean configuration of three 
bodies, p is the highest degree of harmonics retained in an Expansion, 



crease monotu :mally as p increases. Second, the calculated accelerations have 
nonaero components in the z-directjon, which is perpendicular to the plane 
of the three-body configuration, Below, we discuss how these phenomena 



arise. 



i0 



Nonmonotonic der.r^a^c of C r ^ ativv mth mcfflwiwj p 

The nonmonotonic decrease of ^tativt ^ or individual accelerations with 
increasing p is due to the fact that a multipole expansion is a mixed- sign 
Taylor aeries. The error due to truncating a multipole expansion is also a 
mixed-sign series, Consequently the error from, truncating the multipole 
expansion does not necessarily decrease monotoriically when retaining more 
terms in the multipole expansion , Nevertheless, che error in individual ac- 
celerations is always bounded by t mai in TabJc 4.L, which, like the predicted 
error bounds, decreases manotomcally as p increases, 

NotttttO i eotnpoTiGTit for accelerations 

The nonzero z component in our computation results from two kinds of 
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truncation: the primary truncation and the recstfdaffl frtraarthm. We note 
that each term of the multipole expansion in Lemma 3,2.2 of Chapter 3 
is approximated by a convergent Taylor series, which we wiEl call term se- 
riM. In computing the multipole expansion using Lemma 3,2,2, the primary 
truncation truncates, the multipole expansion and the secondary truncation 
truncates each term series . 

In the multipole expansion computed from Lemma 3.2.2, the coefficients 
in each dimension arc determined by parameters in alt three dimensions. If an 
infinite number of terms were retained in computing both the term series and 
the multipole expansion then the z component of the multipole expansion 
won Id he aero due to cancellation among nonzero z terms. Since only a finite 
number of terms are retained, the 2. component of the multipole expansion 
is not completely canceled. 

Intuitively, we might expect that retaining more and more terms in the 
term aeries would reduce in general the secondary truncation error in com- 
puting the multipole expansion and in particular the error in ^direction. But 
the experimental results show that the intuition is unfounded. The reason is 
that the multipole expansion and the term series are mixed-sign series. The 
fact that each of the terras in the multipole expansion is better approximated 
by the term series does not necessarily guarantee that the resulting truncated 
multipole expansion is more accurate. 

(b) Uniform Distribution of 1,000 Particles 

We verify the accuracy claim of our algorithm using a configuration in 
which particles are distributed uniformly. In this test 1 the system has t,0OO 
particles, each of which has mass 1/lflfHl. The results are shown in Table 4.2. 
The errors e raar+ £moi, and £ Jw n, r ,(j a j_. tfttf . S j, are defined as before. The results 
match well with the prediction. 

(e) Plummer Distribution of 1,000 Particles 

We also test our algorithm to determine if the accuracy of the algorithm 
is sensitive to the distribution of particles. The Plummer distribution is a 
nonuniform distribution with the density profile [HerSfi] 

, . IM rg 
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Tabic 4.2: Accuracy test far the uniform made!, p is the highest degree of 
harmonics retained in an expansion. 



which is Spherical and inotropic. M is the mass of the system, and r D js the 
scale- Length. 

In. the test the initial configuration of the system has two clusters, each, of 
which Is a Plummer system with 500 particles. Two clusters have the same 
M and r<> and are separated by 5r . A two-dimensional projection of this 
configuration is shown in Figure 4.3, 

The experimental results are qwii in Tabic 4.3. Again, they match very 
well with the theoretical prediction. 

In summary, the accuracy of the algorithm is demonstrated in Table 4.1, 
Table 4.2, and Table 4,3, These indicate that our method can compute forces 
and potentials to within any prespeciJled tolerance up to the machine round- 
off precision, which is about 7 decimal digits 'n single precision arithmetic 
[Reteo], As. we retain more and more terms in the multi pole expansions, 
the accuracy of our method improves, and wheal p = 20 the error of the 
computation is ftt the level of the machine round-off errors. 



4.2.2 Running Time of the Algorithm 

In this, section, we test the speed of our algorithm and experimentally 
determine how the running time grows with the number of particles. Wc also 
compare the results with those of a direct computation and determine the 
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Figure 4,3: Initial configuiatk>ri of two Plummer systems 
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Table 4.3: Accuracy test for the Plummer model, p h the highest degree of 
harmonies retained art an expansion. 
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Table 4.4: Speed test for the Multi pole-expansion algorithm with p=3. Results 
of a dined computation are also presented for the purpose of comparison. The 
running time excludes paging time. 
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Figure 4.4: Running time growth rate of the Multi pole-expansion algorithm, 
plotted against that of the direct computation. 
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running time gtohkjv«- point. In the test, particles are initially distributed 
uniformly within a spherical region. The space under consideration ia the 
cube of spate containing the spherical rcgion. 

The results of the test on the speed of the algorithm are given, in Table 4.4, 
and are plotted in Figure 4.4. The running time growth rate of the direct 
computation is also plotted for comparison. In the plot, both the running- 
time axia and the nnmber-of-particles. axis are staled logarithmically. The 
parameter p of the algorithm L» chosen as 3 and the calculated potential are 
accurate in average to about iLl -4 . 

Figure 4,4 shows that the running time of our sequential Implementation 
of the algorithm grows linearly with the number of particles. The jumps in 
the curve when the number of particles are 500 and 4.000 are due to the 
overhead of maintaining the tree structure when the tree grows one level 
deeper. The height of the tree grows Logarithmically with the number of 
particles to enforce the constraint that the number of particles in each atomic 
cube is bounded by a predetermined] constant. 

The running time crossover point of our algorithm with the direct com- 
putation is at about 1,000 particles. This means, that for a required average 
accuracy of lO" 4 for the potentials, our sentential algorithm is faster than 
direct computation when there are more than 1,000 particles. 

4,3 Discussion 

We have numerically verified the three-dimensional algorithm with respect 
to the achieved accuracy and speed of the algorithm in the previous sections. 
In this sec Lion, we discuss some additional issue* concerning the use of our 
algorithm, 

(a) I™* Overhead 

We note the overhead of maintaining the 3-D tree data structure in Sec- 
tion 4.2.2, The tree has a complicated pattern of links for the near- fields and 
the interactive-fields at each of the nodes. Much of the computation time is 
spent on tracing these pointers. When the tree is not fully loaded, this part 
of the overhead will dominate. This accounts for the jumps in the curve in 
Figure $A. 
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(b) Distrihutiem of Particles 

We hare assumed in Chapter 3 that the distribution of particles is rel- 
atively uniform within the region under consideration. For very clumpy 
distributions,, the performan ce of the algorithm will degenerate. If we still 
construct She decomposition tree with height of Llog a A r J r step (4) of the 
algorithm would take more than 0{N) time to compute all die near-field 
interactions. Otherwise, to maintain the condition l:i\:{ the cm ruber or parti- 
cles in each atomic cube ia bounded by a predetermined constant, we would 
have to construct the tree with height of O(N) in the worst case. Then 
step (2) an<f step (!?} of the algorithm in Figure ,*J.7 of Chapter 3 would take 
time exponential in N to compute all the multipole expansions. One way to 
improve the efficiency is to dynamically prune empty tree- branches, that is-, 
if no particles arc founded in a branch, the branch will be pruned from the 
tree. 

(c) Choice o/p 

We have cSemonstrated in Section 4,2,1 that the accuracy of our method 
improves as we retain more and more terms in the multipole expansions. 
How big should the parameter p be in practice? This question depends on 
the specific nature of a problem. The goal of a numerical simulation ia dot 
alwF.ys "accuracy 7 ' in a strictly mathematical sense, but rather ''fidelity* to 
the underlying physics bl a sense that is looser and more pragmatic [Pft8o"], 
Often, as in gaJaxy simulations., the conserved system quantities such as total 
energy and total momentum, are of more interest, so monitoring on these 
quantities, will better reflect how good an. algorithm k. Sometimes we only 
want to look at the statistical behuvjur yf a system. In r>uch context, some 
types of errors are much more tolerable than others. Another reason is that 
the error introduced by the discrete integration time step of a simulation is. 
itself comparable with that of truncation [AppS5]- Therefore choosing small 
p often suffices. This will greatly reduce the constant factor in our algorithm. 



Chapter 5 

A Parallel Algorithm 



5.1 Introduction 

It is clear that the JV-body me thud of Chapter 3 can take significant ad- 
vantage of paralleiism. With a, massively parallel computer, where? we can 
allocate a separate processing element for each particle, we can compute from 
expansions the potentials for all of the particles in one parallel step. We can 
also propagate value* up and down the spatial-decomposition tree, as in steps 
(2) and (3) of the algorithm in Figure 3.G, in parallel , generating expansions 
for all the nodes at a given level in one parallel step. The depth of the tree 
grows as log JV, so a complete propagation will require time at least on the 
order of log A r , 

In this chapter, we present an extremely fast implementation of the N- 
body code— a parallel implementation of the algorithm, whose measured run- 
ning time does indeed grow as 0(]o S -V). The code runs on the Connection 
Machine, which is a massively parallel SIMD computer, consisting of up to 
65,536 processors, each with its own local memory, connected in a commu- 
nications network. On the Connection Machine model CM-2 that we have 
been using, each processor is a one>bit ALU with floating- point unit and 64K 
bits of local memory. 

In designing a practical algorithm for a large parallel machine, there are 
two issues that mmt be addressed in addition to the issue of deciding which 
steps should be performed concurrently. For many large parallel computa- 
tions, local computations at each processing element are relatively cheap, 
while communications among processing elements are expensive. The bot- 

45 
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tleneck of the computation is the interprocessor communication. Thus,, an 
efficient program must be designed to minimize both the need for interpro» 
cesser communication and the contention generated by whatever communi- 
cation is required. For the parallel jV-body code, we exploit the locality of 
the algorithm to reduce the need lor communication. In addition wc exploit 
the regularity in the communication patterns of the algorithm in order to 
substantially reduce contention for the underlying communication resource. 

5*2 Why a Parallel Computer? 

In the tree walking of our sequential algorithm, computations on expansions 
could be done concurrently within each level of the tree. Tremendous speed- 
up can. he ejected if we- exploit this parallelism. The question is how to- map 
our problem onto a parallel machine, either a machine specially designed 
for this problem Or a commercially available one, such as the Connection 
Machine. 

Let us take a close look at the sequential algorithm we developed in Chap- 
ter 3. First j the tree computijitiu:: has regularity over ill nodes at each level of 
the tree, A SIMD machine suffices for computation of thin pattern. Second-, 
the tree computation has concurrency. Computation at each of the nodes 
within a single level of the tree can proceed at the same time, Third, the 
tree computation has spatial locality. In the upward- path of tree walking 
each of the nodes talks only to its parent node. In the downward -pain of 
tree walking, each of the nodes talks not only vertically to its parent., but 
also laterally to those in its interactive- field. The lateral communication with 
a node's interactive' field is locai to a subtree rooted the nodc 1 s predecessor 
four levels Up in the tree. Depending on how we construct the tree on a 
parallel machine, we might be able to exploit this locality. Lastly, the tree 
computation has dependencies among levels of the tree. More specifically, 
computation at one level of the tree cannot proceed unless the computation 
one level up or down is finished. This determines that the optimal perfor- 
mance we can achieve from the tree computation is proportional to the height 
of the tree, that is, 0(Eog jV), where W is the number of leaves in the tree. 

We choose the Connection Machine to implement a parallel version of the 
algorithm developed in Chapter 3. 
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5.3 3-D Tree on the Parallel Computer 

We use a 3-D tree as our data structure, as in the sequential implementation. 
The goal of the algorithm design is to distribute the components of the 3-D 
tree structure among many processors so that the concurrency in the tree 
computation ran be maximuqily exploited . 

We allocate a processor to each node of the tree. The upward-path and 
downward- path of the tree walking of I be algorithm require that each node 
in the tree has links to its patent as well as to its children. Explicitly storing 
these links would be expensive, since each node has one parent and eight 
children. However, since the tree has a fixed branching factor, a regular 
allocation of processors for the tree will impose an impHc.it relationship among 
the processor addresses, as long as tbe tree IS allocated within a well-defined 
region of processors. This scheme is called the address-induced representation 
of the tree structure [HilSS] [ChrS*]. 

More precisely, we allocate consecutive processors to nodes, starting at 
the root of the tree. We first make a sequential breadth-first scan of the tree 
to be constructed., and label each node as we encounter it, starting from 
at the. root. Then we assign to every node a processor {or a fixed number 
of processors) whose address is the breadth-first scan label of the node. We 
therefore have a formula to induce the parent- child relationship cor every 
processor in the tree, Given the address t of a processor, we know that its 
parent processor has address of [(i - i)/aj l and its chiid processors have 
addresses from Si + Uo M + S. Hy representing the tree in this way, we avoid 
having to store pointers at every tree node. However, the saving in storage 
is achieved at the cost of computing the tree structure every time we trace 
pointers. 

We alao allocate a processor to each of the particles in a test ' . Each leaf 
node of :be tres remembers addresses of those particles that belong to the 
node Spatially, so that those particles can he accessed simply by referring 
to the leaf node during the computation of its near-field. Similarly those 
particles also remember the leaf node for the initial expansion stage of the 
computation. This representation of particle- leaf- node liliks has An additional 
advantage. After every time step of the simulation, the tree may he updated 
for particles that move to new cells. This process is easy, since only particlc- 

l h a possible that the allocation of the tree uverLnpj with that of the f^rtidra. 



CHAPTERS. A PARALLEL ALGORITHM 48 

leal- node Links need to be updated, 

S+4 A Parallel Algorithm 

Id this section, we will describe a parallel implementation of the algorithm 
we developed in Chapter 3. The basic patterns of corcimunieation include 
passing and combining data among processors vertically and laLcrally with 
respect to the embedded tree structure. The description of the algorithm is 
summarized in Figure 5.1, Notations are used in the same way as that in the 
sequeci'.iul dlffurithm ■uf bi^un.- 'i ■:>. 1 Jiv :. umber oi levels in Uiu ".re.' ll r;. 

In Step (1) of the algorithm! each node of the tree is mapped Co a proces^ 
aor Since the tree only needa to be built oncej, r.ln: sequential construction i:; 
not a serious drawback as long as we amortize its cost over many time frames 
in a simulation. 

Step (2) finds lor every node of the tree its near- field and interactive- 
field nodes. This could be done at run time, since the addresses for these 
nodes take a fair amount of memory spaces. But this- means that we have 
to find these nodes every time we want to access them. We know that there 
are SI nodes in a node's near-field, and 567 nodes (using heuristics this can 
be- reduced to 140 nodes) in each interactive- field (we ignore the boundary 
conditions). Experiments have shown that the calculation of these fields is 
very expensive. So we precomp-utc these fields and store them, Step (3) 
inserts ?.ll the particles into the tree. 

Step (4) initially expands potentials for all particles and forms the ex- 
pansions * for leaf nodes of the tree. An expansion for the potential of a 
particle is valid outside the leaf node the particle belongs to,, and has the 
center of the node as the reference point. Ail expansions of a single leaf node 
are added together. This is done all in parallel. ($*s and ^ : s arc expansions 
as defined in Section 3,3;, and are represented as arrays of parallel variables.} 

Step (5) implements the upward nath of the tree walking, which computes 
$ for every node of the tree. Th* computation proceeds in this way: nodes at 
one level of the tree in parallel shift the $ h s to the centers of their respective 
parents, combine tne shifted ty's- according to whether they have the same 
parents, and send the results to the parents. 

Step (6) walks down the tree and computes $'a for every node of the tree. 
Every node in parallel fetches * at its parent node, and shifts it to its center. 
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(1) Tree fiio hiding: allocate one proce&sot for every tree node, starting at the 
rciur. 

allocate proc.fli8.flgr for the, root; 
for i frotfl 1 to n do begin 

fur j iron i to S 1 do begin 

allocate, processor S^S" -3 -1J/7 + J tor the j*h to the 
leftmost nod* at lexel 1' of the tree. 
end. 
end. 



(2) Find for each of nodes Its near-field and interact ive.fl*ld nod 



£?: 



for all nodes of the tree in parotid do begin 

each node finds its near-field and interactive-field nodas 

uid etorea the address aa. 
end. 

(3) Insert xV particles: 

for i fraai 1 to N do begin 

insert particle j" into the tree , 

(4) Initial expansion at leaf nodes: 

for nil particles irt parallel do begin 

compute expansion * far the potential due t« each of particles 

relative to caster of the l» a f node the particle belongs to; 
add together those * J s due to particles of the awua l &a f node 
and form $ fox that leaf node, 
end. 



Figure 5.1: A pa;alld algorithm 
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(5) Upward- path, of tree walking: 

far t from n dcnm to 1 do begin 

for alf oodM at 1*y*1 * i?j p^-titlel i3o begin 
at each node 

shift the $ to the center of its parent node; 
CUB the resulted expansions, of those who have the 
same parent node and form $ for that parent node. 
end. 
•nil. 

(6) Downward-path of tree walking: 

for i from 1 to n do begin 

for ail n^des at Level r in parvilci do begin 
at each nods dl 

t6a) shift. Tir at dl J s parent node to til's center; 
{6b) for node d2 in dl J s interactive-field, do begin 
ohlit 4 *t tf2 to tfl's center; 
end. 

add the srdft*d £'a together - 
add (6a) and (6b) to form $ for node dl . 
end. 

(7) Final evaluations compute near-field and far-field interact ion* fftt each -of 
parddes. 

for fiit particle* m paiutfeJ do begin 
at each, particle 

(7a) *valuata at the particle position the ¥ of the leaf 

node the particle belongs to; 
(7b) compute diraetly potentials du* to it* Mar-fiald. 
add (7a) and (7b) t» fern the desired potential, 
end. 



Figure 5,2: A parallel algorithm (ee-n't) 
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This is done level by lewd, as in Step (5), 

Finally, step (7) finishes the computation. Every particle evaluate, in 
parallel the corresponding *'s it the particle position. This part of the 
potential is due to the interaction with iu far-field, Then every particle 
computes in parallel the near- field interactions directly. The sum of the 
far- field and near- field interactions is the desired potential at this particle 
position. 

We have seen from the above description that computation at a given 
level of the tree in steps (4)-(7) takea one parallel step. Thus, step (4) 
and step (7) each take constant time. Since the tree has depth of log A r , 
and computations at different levels of the tree have dependencies among 
them, a complete propagation in step (0) and step (6) takes Of log jV) time. 
Therefore, the time complexity of the parallel algorithm, excluding the initial 
set-up time for the tree and communication costs, is Q(logN). 

Let us now look at the memory usage in the computation. When we 
retain in an expansion only the terms of spherical harmonics with degree less 
than four, the total number of terms m an expansion is 20. As verified by the 
experimental results from the sequential implementation, this guarantees in 
average a four decimal-digit accuracy in potentials. Each of the coefficients 
in an expansion is a 32-blt single precision floating-point number. We have 
three different arrays of coefficients for each node of the tree in the computa- 
tion. Thus expansions alone take about 2K bits of memory at each of nodes- 
Storing the interactive-field and nea*- field takes another 4K bits of memory! 
We also need aome stack space for intermediate computation. Therefore, 
about 10K bits of local memory at each node suffice for our purpose In case 
a machine has insufficient amount of local memory per processor, we can 
remedy this by simulating a node of the tree with several processors. In such 
a case, however, the cost of the communication overhead would be high. 

5.5 Experimental Results 

We have implemented the parallel algorithm on the Connection Machine. 
The experimental results are summarized in Table 5.1. The running time 
growth rate of the implementation is plotted in Figure ,y. 3, along with that 
of the sequential implementation. The results experimentally verify that the 
parallel implementation of the algorithm on the Connection Machine scales 
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Table 5.1i Experimental resultion the C&nncction Machine. The running time 
exclude paging time. 
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Figure 5.3: Running time growth rate of the parallel algorithm, With that of th« 
sequential implementation, 
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logarithmically in the total dumber of particle in the simulation, even when 
we take the communication into the consideration. We find that the pallid 
algorithm tuna faster than the sequential algorithm, even on small exam- 
ples. Compared with the sequential implementation,, the parallel algorithm 
exhibits a. speed-up factor of about 10 for AF=l,(HrQ And a speed-up factor of 
about 100 for N=VX$XP. 

The complexity issue of the parallel algorithm, which » complicated by 
the need for communication, wilt be discussed in. more detail in the following 
sect ton . 



5-fi Communication Patterns 

We have discussed the complexity issue due to the floating-point compu- 
tation in. Section 5.4. Our experiments have shown that the float.uig-pc.jnG 
computation takes about 2d% of the total running time and the interprocee- 
sor communication COnaumeH a major portion of the rest 3 . 

A careful Study at the communication patterns reveals that there are 
two major kinds of communication going on in the computation. The first 
kind of communication occurs between adjacent levels of the tree. We call 
this vertical communication , The second kind of communication occurs 
between nodea and their interactive-fields and near- fields, and takes place 
within a single level of the tree. We call this lateral communication 4 . The 
experimental results show that, when the tr** is avcrageEy loaded, the lateral 
communication constitutes about 20% of the total running time, and the 
vertical communication accounts for another 25%. In the following -sections, 
we describe various optimisations we have made to reduce the communication 
overhead- 



The parallel computation has about 200 million fWimg-pouu «IguIbIkhi9 in a test of 
1,000 particle,, and has stent 1,«H million tWing-point filiation* in a teat uf L0.1XH) 
particles. 

a Th# iesuita are ahtaiiKd an ih<! Couneetlon Machine lutqg the meteiinji urocam 
provided by Shawn Mclean, 

< Recall from S«tian 4,1.2 that a mpm-aade » a node farmed by gMiipLag on 
int<r«t]ve.ri«ld nodes. Communicate with a aupennod* in the KirraJ communica- 
tion involve two adjacent fcvda of th.e lh*. Although this situation k Btnulat to the 
vertical wmmiKi Nation, for the sata of simplicity we wilt consider it lo be the lateral 
communication. 
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5.6.1 Reducing Communication Bottlenecks 

The lateral communication required for computing interactions with near- 
fields ami interactive-fields is one of the most expensive parts of the parallel 
compulation. The coat of the lateral communication is due largely to mea- 
sqjc coliiffions in which many processors try to access a single processor at 
the same time. Since processor access on the Connection Machine is handled 
m an exclusive reading and exclusive writing way, the time of a successful 
message routing is proportional to the maximum number of message colli- 
sions. 

We tried to ease the communication bottleneck by reducing the number 
of collisions in the lateral communication. We know that each node has 140 
nodes in its interactive- held, and those nodes are represented as a list of 
addresses. A close look shows that many processors often have the same 
node as their interactive- field node at the same time, due to the way the 
interactive-field list is- constructed. Our algorithm, therefore, randomises 
the ord«r in which one's interactive-field nodes nodes are accessed,, In the 
hope that two processors are less likely to access a single processor at the 
•■niiui :it:ie. A^ a result the randomization 1:il::u-ov'-* lo'- ^TlorniiLno:- ,;f \'.x 
algorithm. 

5.6.2 Localizing Communication 

In computing the interaction with a node's interactive-field., we need to use 
Coefficients of expansions stored in an interactive- field node again and again. 
Instead of fetching these every time we use them, we fetch them once and 
store them on a temporary local stack, This can reduce the traffic by a factor 
of 5. when computing expansions whose highest degree is three, 

5.6.3 Combining and Delegating Messages 

The Connection Machine router has fairly good performance when the net- 
work is averagely loaded, hut the rouunj rime scales up as the number of 
collisions in the touting process, Collisions are mostly due to Concurrent read 
or concurrent write to a single processor. To reduce the collisions, we can use 
the tfcas operation, which does the segirreuted parallel prefix computation on 
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the Connection Machine 5 . To resolve collision due to the concurrent read, 
we first do am exclusive read^ and then do a acaii-copy operation. In cane of 
the concurrent write combined with addition, ue do a scan-+ and then an 
exclusive write instead. If elements of a. segment to be scanned are not in 
consecutive processors, in order to apply the scan operator we have to first 
project the element* to a aevf segment so that the new segment h&* elements 
in consecutive processors. This extra projection pays off when there are more 
than thirty processors in a segment. 




Figure 5.4: Concurrent write 

In our implementation of the algorithm, the final direct interactions with 
a node 1 * near-field often gives rise to 10-20 collisions. Therefore we use 
a different method to reduce the collisions. The idea is to structure the 
message routing through a tree that distributes the message collisions over 
many intermediate nodes. This tree-like routing Allows many of the collisions 
to be handled in parallel, thereby reducing the total time delay due to the 
collisions, 

Consider an example in which 16 processors access a single processor for 
concurrent write (Figure 5.4). Using the tree routing scheme outlined, above, 

*Give& a binary associative optralpr * and a Kgrnent of sequence of Omenta 
H.zj, ...,*„,, *ru? paxaJlel prefix operation computes Jr u i t * x t , ..., n *i, 4 ,.. x 
in. Of loaf m}) Linn [UJSflJ. In particular, thrrc are &tan*+ «id sea n.t» n operat iem Tta 
jc™ operation 1; vary elflci*nlfr implemented on the Cn^nectton Machine. For simple 
bjftwv operation* ttich as + and cvpv, it tafea the same order of time *& a renting eyde 
J&ea, which is the fundamental unit for time nHuuKnwDt, and ll therefor* considered to 
t«Jcc constant amount of tijii*. 
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Figure ft.rt: Combining messages 




Figun? 5.6: Delegating messages 
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we can use a two- level tree to combine every four of the writes together, and 
then combine every f our ( the resulting writes (Figure S,5). Even though 
there are 20 total collisions in this tree, the time delay is only proportional 
to S collisions because the nodes at the Intermediate level of the tm> handle 
tbeir collisions in parallel. This process is called meaange-tombitimg. The 
dual process, message-delegating, reduces collisions due to concurrent read 
as illustrated, in Figure 5.6. 

In general, suppose there are N collision* in total. Each stage of combin- 
ing; or delegating reduces the number of messages by a factor of m. Therefore 
the number of stages is log m A r . What is the optimal value of m such, thai 
the function 

Routing- time(f7i,jV) = mlog^fjV) 

is minimal? Using elementary calculus, we find that the optimal m = t 
where e h the base of the natural logarithm 2.7 IS... Of cou^e, we must 
actually choose an integral value for m (2 or 3), 

Message-combining and message-delegating have been very effective, and 
in practice have given a Factor of 2 improvement in speed. 

5.7 Discussion 

In this section, we discuss traders we made in our parallel implementation, 
and suggest alternatives to this implementation. 

5.7,1 Grid Representation 

We ha*c used, an address-induced scheme to construct the tree on the Con- 
nection Machine. This mapping scheme has the merit of simplicity. It also 
saves memory space, by avoiding explicitly storing pointers in the" tree. As 
a result of this consecutive mapping scheme, only a Subset of processors are 
active at a Lime, since th« computation proceeds level by level in the tree. 

The alternative to the address^induced scheme is to use the Connection 
Machine grid-like communication network called the NEWS grid. This struc- 
ture provides fast communication for local or highly structured communica- 
tion patterns [Hil35j. Every processor is assigned a grid coordinate and can 
be addressed by specifying this coordinate. The tree can he embedded in the 
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Figure 5-7: Superimposed mapping. Each Square- in the grid represents a pro- 
cessor, 3nd each number represents a node in the tree. 
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network in two different ways. One way is to superimpose levels of the tree 
SO thai nodes fit one level may share processors with those at Other levels as 
illustrated in Figure 5.7. As i reeult n some processors require more memory 
than others do. This scheme makes the processor utilization at most 100%. 
However, since all processors on the Connection Machine must have the same 
amount of local memory, the nonuniform memory requirement across proces- 
sors wastes memory in gome processors. The alternative is to unfold levels 
of the tree and to assign to every node a processor, as in Figure o.§. This 
makes the memory utilization at most 100ft but wastes some processors. 

By using the NEWS grid, we can take advantage of the fast grid routing 
for highly Structured communication patterns, such as the communication 
with interactive-fields discussed below, 

5.T.2 Exploiting Regularities in Communication Pat- 
terns 

To further reduce the number of collisions discussed in Section 3,6,]., Alan 
Ruttenberg has suggested a scheme to exploit the regularities in. the commu- 
nication patterns. We have not implemented this, so we do not know how 
much it decreases the running time of the algorithm. To keep the discussion 
simple, we present the scheme in two dimensions. The extension to three 
dimensions is straightforward, 

We classify Squares of the spatial decomposition into Four classes, Suppose 
ill the squares of smallest siEe of Figure 5 J are at level I. We first group 
every four squares that share a common parent at level I- 1. Then we classify 
the four squares of a group as class 1, 2 ,3« and 4 respectively according to 
the relative position of a square in the group, *s labeled in the Figure 5.9. 

We find that for every group there is- spatial symmetry among the 
interactive-fields of its four squares. In Figure 5.10, square a' is an interactive- 
field square of square «_ A clockwise rotation of 90 degrees of the square 
a' with respect to the point O results in another square 4\ which is an 
interactive- field square of square *> and so on. As a result of the spatial 
symmetry, the four squares a\ ^, ^ and ^ are in different da^es, as shown 
m Figure 5.10, This symmetry property can be used fco completely eliminate 
message collisions in the communication with interactive-field squares. The 
idea is as follows: each of four squares o, fc, c , and d interacts with Squares 
a , tf, c\ and <P respectively at the same time so that the communication 
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Figure 5-10- Symmetric communication pattern 
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pattern is symmetric with respect to four squares; the same is true for alt 
other groups. Since the four squares a\ bf, c\ and tf are all La different classes 
and communication patterns of two different groups are identical except that 
they are shifted from each other hy some distance, none of the four squares 
a\ V, c', and d' will be fetched by more than one square at a time. Thus, 
lateral communication can take place without collisions, 

III Section 4.1.2, we discussed super nodes itl the mfceractive-ndds. Since 
the super nodes are one level higher in the decomposition tree than those 
ordinary nodes in the interactive- field, interactions with a super node using 
the above scheme will give rise to copious. However, the maximum number 
of interactions to a super node will not exceed 4 at a time. Therefore the 
collisions can he avoided by spreading the data to be fetched at a super node 
to its four children, and afterwards, access to the super node will he directed 
to the four children. 

In the above discussion we also ignored the boundary case. We can intro- 
duce dummy squares outside the boundaries so that boundary squares still 
preserve the symmetry property of interactive-fields. 

5.8 Conclusion 

We have described the parallel implementation of the three-dimensional N- 
body algorithm thae runs on the Connection Machine in order G(log jV) time 
in this chapter. Compared with the previous A r -body methods, our method 
has a demonstrated advantage in simulating large number of particles. The 
parallel implementation achieves tremendous speedup in running time and 
computes the forces and potentials to within any prespeci&ed tolerance up 
to machine precision, 

Combining the results of this chapter with the analytical results derived 
«n Chapter 3 and the experimental results of the sequential implementation in 
Chapter 4, we have presented a complete analysis of the three-dimensional 
A -body algorithm both theoretically and experimentally. Because pf the 
superior sp«d and accuracy of our algorithm, we expect that tt will find 
applications in astrophysics, plasma physics, fin in dynamics, and molecular 
dynamics. Nevertheless, there are additional improvements that we believe 
wJl enhance the practicality of the algorithm and, therefore, are worth fur- 
ther research effort. 
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Wic have observed in Section 5,d that local computation — mainly floating- 
point calculation at each processing cteniecLt of the parallel implementation — 
takes only about 20% of total running time. The rest of the running time 
is meetly spent on communication among processing elements. The results 
indicate that the bottleneck is the interproccssor commm] Leal ion. Therefore, 
an efficient implementation of the algorithm should explore the communica- 
tion requi rem ants- of the algorithm. 

Ill Section 5.7 we dfiarxibed alternatives to the current paraJle] implemen- 
tation to improve the performance of the algorithm. By using the NEWS 
grid h we can exploit the localities of the algorithm and take advantage of the 
fast grid routing. The use of the NEWS grid will also enable us to exploit the 
regularities in the lateral communication of the algorithm. We expect that 
the resulting implementation will ease the bottleneck of the interprocessor 
communication. 
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